Carnegie Mellon University - Tepper School of Business¶

46-886 - Machine Learning Fundamentals¶

22/23 - Mini 4 - Section M4 - Farahat¶


Deliverable 3¶

Due Wednesday, April 12th at 2:59 a.m. Pittsburgh time

This is team assignment. Please provide below the names of all team members who have contributed to this deliverable:

  • TEAM 14
  • Name of team member 1: Thomas Gregg
  • Name of team member 2: Michelle Li
  • Name of team member 3: Suhas Narendrula
  • Name of team member 4: Shaun Pfister
  • Name of team member 5: Rodolfo Lerma

This deliverable consists of one problem of several parts. Each part requires you to code in Python and answer one or more questions. Save a copy of this notebook and provide your solution by inserting code or Markdown text cells as appropriate.

You will need to submit two files through Canvas: 1) a completed Jupyter notebook (extension .ipynb) with your code and answers to each question (this file); and 2) an html rendering of your completed Jupyter notebook. Only one team member needs to submit the files on Canvas on behalf of the rest of the team.

Problem 1 - Predicting House Prices in Ames, Iowa (100 points)¶

We consider the Ames, Iowa Housing Prices dataset, which describes sales of 2,818 properties in the town of Ames, Iowa from 2006 to 2010[1]. Work with the dataset provided in the ames.csv file. This file has been pre-processed to simplify the analysis. The file contains 73 variables, described on the last page of this handout. The first variable characterizes the property's sale price -- which we aim to predict. The other variables describe the property in detail, both in quantitative terms (square footage, size of the lot, number of rooms, date of construction, etc.) and qualitative terms (building materials, style of the home, etc.).

1 De Cock, Dean. “Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project.” Journal of Statistics Education, 19.3 (2011).

In this problem, you will compare three predictive analytics methods: linear regression, CART, and random forest.

First, import the dataset and create a 80-20 train-test split. Use 886 as the random seed throughout your code.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import plotly.express as px
import plotly.graph_objects as go
In [2]:
#Data Exploration
data = pd.read_csv("ames.csv")
In [3]:
data.head()
Out[3]:
SalePrice MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope ... EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MoSold YrSold SaleType
0 215000 RL 141.0 31770 Pave No Alley IR1 Lvl Corner Gtl ... 0 0 0 0 No Pool No Fence No Feature 5 2010 Warranty Deed
1 105000 Other 80.0 11622 Pave No Alley Reg Lvl Inside Gtl ... 0 0 120 0 No Pool MnPrv No Feature 6 2010 Warranty Deed
2 172000 RL 81.0 14267 Pave No Alley IR1 Lvl Corner Gtl ... 0 0 0 0 No Pool No Fence Feature 6 2010 Warranty Deed
3 244000 RL 93.0 11160 Pave No Alley Reg Lvl Corner Gtl ... 0 0 0 0 No Pool No Fence No Feature 4 2010 Warranty Deed
4 189900 RL 74.0 13830 Pave No Alley IR1 Lvl Inside Gtl ... 0 0 0 0 No Pool MnPrv No Feature 3 2010 Warranty Deed

5 rows × 73 columns

In [4]:
import plotly.express as px
import numpy as np

fig = px.histogram(data, 
                   x="SalePrice", 
                   title="Sales Price Distribution",
                   nbins=50)

# Make the bar plot gray
fig.update_traces(marker=dict(color='gray'))

# update the plot background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')

# update the title font size to 28
fig.update_layout(
    plot_bgcolor='rgba(0,0,0,0)',
    title=dict(font=dict(size=28)),
    yaxis=dict(titlefont=dict(size=18), tickfont=dict(size=14)),
    xaxis=dict(titlefont=dict(size=18),tickfont=dict(size=14)))

# Calculate the median of the 'SalePrice' column
median_sale_price = np.median(data['SalePrice'])

# Add a vertical dotted line to represent the median
fig.add_shape(
    type='line',
    x0=median_sale_price,
    x1=median_sale_price,
    y0=0,
    y1=1,
    yref='paper',
    line=dict(color='crimson', dash='dot'),
)

# Add an annotation for the median price
fig.add_annotation(
    x=median_sale_price,
    y=1.1,  # Adjust the y value to make the note a little bit higher
    yref='paper',
    yanchor='top',
    text="<b>Median Price</b>",
    showarrow=False,
    font=dict(size=14, color="crimson"),
    bgcolor='rgba(255, 255, 255, 0.7)',
    bordercolor='rgba(255, 255, 255, 0)',
    borderpad=4,
    xshift=5
)

fig.show()
In [5]:
# Replace CentralAir column with binary values
data['CentralAir'] = data['CentralAir'].map({'Y': 1, 'N': 0})
In [6]:
data = pd.get_dummies(data, drop_first=True)
In [7]:
# Train-test split
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=886)

Part (a) [15pts]¶

Construct a linear regression model on the training set. You can use all the variables in he dataset (that is, do not worry about issues of multi-collinearity or variable selection). What is the in-sample and out-of-sample R-squared for this model?

In [8]:
# Linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)

# In-sample and out-of-sample R-squared
lr_train_r2 = r2_score(y_train, lr.predict(X_train))
lr_test_r2 = r2_score(y_test, lr.predict(X_test))

print("Linear Regression In-Sample R-squared:", lr_train_r2)
print("Linear Regression Out-of-Sample R-squared:", lr_test_r2)
Linear Regression In-Sample R-squared: 0.9015163293064694
Linear Regression Out-of-Sample R-squared: 0.8639768388522663

Part (b) [15pts]¶

Train a CART model with max_depth set to 3 and without changing any other default parameters. Use a random_state value equal to 886.

In [9]:
# Train a CART model
cart = DecisionTreeRegressor(max_depth=3, random_state=886)
cart.fit(X_train, y_train)
Out[9]:
DecisionTreeRegressor(max_depth=3, random_state=886)
In [10]:
# In-sample and out-of-sample R-squared
cart_train_r2 = r2_score(y_train, cart.predict(X_train))
cart_test_r2 = r2_score(y_test, cart.predict(X_test))

print("CART In-Sample R-squared:", cart_train_r2)
print("CART Out-of-Sample R-squared:", cart_test_r2)
CART In-Sample R-squared: 0.693343377808766
CART Out-of-Sample R-squared: 0.6398765664946515

Part (c) [10pts]¶

A friend of yours is looking to sell her home in Ames, Iowa and has called you for advice. The house does not have a central air conditioning system. She is wondering whether it would be worth to have one installed in order to increase the value of her home (at a cost of $15,000). What would your recommendation be based on your CART model and what does this say about the model?

In [11]:
#Explore the distribution of the variable
data['CentralAir'].value_counts()
Out[11]:
1    2652
0     166
Name: CentralAir, dtype: int64

Options 1: Comparing values directly¶

In [12]:
import pandas as pd

# Assuming 'data' is your preprocessed dataset with one-hot encoding
data_with_ac = data[data['CentralAir'] == 1]
data_without_ac = data[data['CentralAir'] == 0]

avg_price_with_ac = data_with_ac['SalePrice'].mean()
avg_price_without_ac = data_without_ac['SalePrice'].mean()

price_difference = avg_price_with_ac - avg_price_without_ac

print("Average price with AC:", avg_price_with_ac)
print("Average price without AC:", avg_price_without_ac)
print("Average price difference:", price_difference)

if price_difference > 15000:
    print("Recommendation: Installing central air conditioning generally increases house value by more than $15,000.")
else:
    print("Recommendation: Installing central air conditioning generally does not increase house value by more than $15,000.")
Average price with AC: 182509.21681749623
Average price without AC: 110464.01807228915
Average price difference: 72045.19874520708
Recommendation: Installing central air conditioning generally increases house value by more than $15,000.

Option 2: Using the model on the same data where the CentralAir variable is activated manually¶

In [13]:
# Assuming 'X_test' is the out-of-sample dataset with one-hot encoding
X_test_without_ac = X_test.copy()
X_test_with_ac = X_test.copy()

X_test_without_ac['CentralAir'] = 0
X_test_with_ac['CentralAir'] = 1
In [14]:
# Assuming 'cart_model' is the previously trained CART model
predictions_without_ac = cart.predict(X_test_without_ac)
predictions_with_ac = cart.predict(X_test_with_ac)
In [15]:
avg_prediction_without_ac = np.mean(predictions_without_ac)
avg_prediction_with_ac = np.mean(predictions_with_ac)

prediction_difference = avg_prediction_with_ac - avg_prediction_without_ac

print("Average predicted price without AC:", avg_prediction_without_ac)
print("Average predicted price with AC:", avg_prediction_with_ac)
print("Average predicted price difference:", prediction_difference)

if prediction_difference > 15000:
    print("Recommendation: Installing central air conditioning generally increases house value by more than $15,000.")
else:
    print("Recommendation: Installing central air conditioning generally does NOT increase house value by more than $15,000.")
Average predicted price without AC: 178108.12312136224
Average predicted price with AC: 178108.12312136224
Average predicted price difference: 0.0
Recommendation: Installing central air conditioning generally does NOT increase house value by more than $15,000.

Option 3: Feature importance¶

In [16]:
importances = pd.Series(cart.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)
print("Feature importances:\n", importances)
Feature importances:
 ExterQual_TA              0.571090
GarageCars                0.202964
GrLivArea                 0.083328
X1stFlrSF                 0.062334
BsmtQual_Gd               0.037302
                            ...   
Neighborhood_SWISU        0.000000
Neighborhood_Sawyer       0.000000
Neighborhood_SawyerW      0.000000
Neighborhood_Somerst      0.000000
SaleType_Warranty Deed    0.000000
Length: 187, dtype: float64
In [17]:
import plotly.graph_objects as go

# Get the first 15 variables in order of feature importance
top_15_importances = importances.head(15)

# Create a custom color list
colors = ['crimson' if i < 2 else 'gray' for i in range(len(top_15_importances))]

# Create a bar plot
fig = go.Figure()

fig.add_trace(go.Bar(
    x=top_15_importances.index,
    y=top_15_importances.values,
    marker_color=colors,
    marker_line_width = 0 # Remove the line around the bars
))

# Customize the plot
fig.update_layout(
    title={'text': 'Top 15 Feature Importances', 'font': {'size': 24}},
    yaxis_title='Importance',
    xaxis_tickangle=45,
    xaxis_title=None,
    plot_bgcolor='rgba(0,0,0,0)',
    yaxis=dict(titlefont=dict(size=18), tickfont=dict(size=14)),
    xaxis=dict(tickfont=dict(size=14))
)

# Add a note to the center right of the plot
fig.add_annotation(
    x=0.95,  # Adjust the x value to position the note at the center right of the plot
    y=0.5,
    xref='paper',
    yref='paper',
    text="The first 2 variables account <br> for more than <b>75%</b> of the variation of the data",
    showarrow=False,
    font=dict(size=14, color="black"),
    bgcolor='rgba(255, 255, 255, 0.7)',
    bordercolor='rgba(255, 255, 255, 0)',
    borderpad=4,
    xanchor='right'
)

fig.update_traces(marker=dict(line=dict(color='rgba(0,0,0,1)', width=1)))

fig.show()
In [18]:
central_air_importance = importances["CentralAir"]
print("CentralAir importance:", central_air_importance)
CentralAir importance: 0.0

The feature importance of the CentralAir variable is 0.0, which indicates that it does not contribute significantly to the prediction of the home prices in our current model. This means that, based on the data and the model used, the presence or absence of central air conditioning does not have a substantial impact on the predicted value of a home. It is important to note that feature importance may vary depending on the dataset and the model used. In this case, our model does not find central air conditioning to be a key factor in determining home prices.

Prediction using an example for the test dataset¶

In [19]:
# Get the index of the selected home
home_index = X_test.iloc[10].name

# Get the actual value of the home
actual_value = y_test.loc[home_index]
print("Actual value:", actual_value)
Actual value: 186800
In [20]:
# Predict home value without central air conditioning
home_data = X_test.iloc[10].copy(deep=True)  # Replace this with the actual data for your friend's home
home_data["CentralAir"] = 0
home_data_df = pd.DataFrame(home_data).T
home_data_df.columns = X_test.columns
predicted_value_without_ac = cart.predict(home_data_df)[0]

# Predict home value with central air conditioning
home_data["CentralAir"] = 1
home_data_df = pd.DataFrame(home_data).T
home_data_df.columns = X_test.columns
predicted_value_with_ac = cart.predict(home_data_df)[0]

print("Predicted value without central air conditioning:", predicted_value_without_ac)
print("Predicted value with central air conditioning:", predicted_value_with_ac)

price_difference = predicted_value_with_ac - predicted_value_without_ac
print("Price difference:", price_difference)
Predicted value without central air conditioning: 190014.5353773585
Predicted value with central air conditioning: 190014.5353773585
Price difference: 0.0

We can observe that the CentralAir variable does not have significant predictive power in our current model. In other words, the model was unable to identify a difference in price attributable to the presence of central air conditioning based on the data used in this case.

Part (d) [25pts]¶

Use 10-fold cross-validation to select a good max_depth value for your tree and then fit your final model. Consider max_depth values from 2 to 10. What is the in-sample and out-of-sample R-squared for the selected model (fitted on the entire train set)? Use "r2" as the scoring method.

Option 1¶

In [21]:
# Define the parameter grid for max_depth
param_grid = {'max_depth': list(range(2, 11))}

# Create a DecisionTreeRegressor with random_state=886
tree = DecisionTreeRegressor(random_state=886)

# Create a GridSearchCV object with 10-fold cross-validation
grid_search = GridSearchCV(tree, param_grid, cv=10, scoring='r2', n_jobs=-1)

# Fit the GridSearchCV object on the training data
grid_search.fit(X_train, y_train)

# Get the best max_depth value
best_max_depth = grid_search.best_params_['max_depth']
print("Best max_depth:", best_max_depth)

# Fit the final model using the best max_depth value on the entire training set
final_tree = DecisionTreeRegressor(max_depth=best_max_depth, random_state=886)
final_tree.fit(X_train, y_train)

# In-sample and out-of-sample R-squared
final_tree_train_r2 = r2_score(y_train, final_tree.predict(X_train))
final_tree_test_r2 = r2_score(y_test, final_tree.predict(X_test))

print(f"In-sample R-squared: {final_tree_train_r2:.4f}")
print(f"Out-of-sample R-squared: {final_tree_test_r2:.4f}")
Best max_depth: 7
In-sample R-squared: 0.9088
Out-of-sample R-squared: 0.7545

Options 2¶

In [22]:
max_depth_values = list(range(2, 11))
cv_scores = []

for max_depth in max_depth_values:
    tree = DecisionTreeRegressor(max_depth=max_depth, random_state=886)
    scores = cross_val_score(tree, X_train, y_train, cv=10, scoring='r2')
    cv_scores.append(np.mean(scores))

best_max_depth = max_depth_values[np.argmax(cv_scores)]
print("Best max_depth:", best_max_depth)

# Fit the final model with the selected max_depth
final_tree_model = DecisionTreeRegressor(max_depth=best_max_depth, random_state=886)
final_tree_model.fit(X_train, y_train)

# In-sample and out-of-sample R-squared
in_sample_r2 = final_tree_model.score(X_train, y_train)
out_of_sample_r2 = final_tree_model.score(X_test, y_test)

print(f"In-sample R-squared: {in_sample_r2:.4f}")
print(f"Out-of-sample R-squared: {out_of_sample_r2:.4f}")
Best max_depth: 7
In-sample R-squared: 0.9088
Out-of-sample R-squared: 0.7545

Part (e) [25pts]¶

Construct a random forest model with 1000 trees. Leave other hyperparameters at the default values. What is the in-sample and out-of-sample R-squared for the selected model? Again use a random_state equal to 886.

In [23]:
rf = RandomForestRegressor(n_estimators=1000, random_state=886)
rf.fit(X_train, y_train)

y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

in_sample_r2 = r2_score(y_train, y_train_pred)
out_of_sample_r2 = r2_score(y_test, y_test_pred)

print(f"In-sample R-squared: {in_sample_r2:.4f}")
print(f"Out-of-sample R-squared: {out_of_sample_r2:.4f}")
In-sample R-squared: 0.9836
Out-of-sample R-squared: 0.8795

Part (f) [10pts]¶

Which of the four models you constructed in parts (a), (b), (d), and (e) would you recommend? Make sure to justify your choice and to discuss the strengths and limitations of the models.

In [24]:
results = {}
models = {'Linear Regression': lr, 'CART (max_depth=3)': cart, 'CART (best max_depth)': final_tree_model, 'Random Forest': rf}

for name, model in models.items():
    y_test_pred = model.predict(X_test)
    out_of_sample_r2 = r2_score(y_test, y_test_pred)
    results[name] = out_of_sample_r2
    print(f"{name} Out-of-sample R-squared: {out_of_sample_r2:.4f}")
Linear Regression Out-of-sample R-squared: 0.8640
CART (max_depth=3) Out-of-sample R-squared: 0.6399
CART (best max_depth) Out-of-sample R-squared: 0.7545
Random Forest Out-of-sample R-squared: 0.8795
In [25]:
import plotly.graph_objects as go

# create a list of colors for the bars
colors = ['gray'] * 3 + ['crimson']

# create a bar trace
bar_trace = go.Bar(x=list(results.keys()), 
                   y=list(results.values()),
                   marker_color=colors,
                   text=[f"{val:.2f}" for val in results.values()],
                   textposition='auto',
                   textfont=dict(size=24))

# create a layout with a larger title and no background
layout = go.Layout(title=dict(text='Out of Sample Performance', font=dict(size=28)),
                   plot_bgcolor='rgba(0,0,0,0)',
                   #xaxis_tickangle=-45,
                   yaxis=go.layout.YAxis(title=None, tickvals=[]))

# create a figure with the bar trace and layout
fig = go.Figure(data=[bar_trace], layout=layout)

# update the x-axis tick font size
fig.update_xaxes(tickfont=dict(size=18))

# show the figure
fig.show()

Based on the out-of-sample R-squared values calculated previously, we can assess the performance of the models and consider other factors such as model complexity, interpretability, and training time.

  • Linear Regression: This model is simple and easy to interpret. However, it may not capture complex relationships between the features and the target variable as effectively as other models.
  • CART (max_depth=3): This model is more complex than linear regression and can capture non-linear relationships. It is also fairly interpretable, but it may suffer from over-simplification due to the shallow depth.
  • CART (best max_depth): This model is a more flexible decision tree, as it is trained with the best max_depth found using cross-validation. While it may perform better than the previous two models, it may be harder to interpret as the tree depth increases.
  • Random Forest: This model usually has the best predictive performance among the four models, as it is

If you are recommending a model to friends who do not understand machine learning, it is better to choose a model that is simpler, more interpretable, and easier to explain. In this case, I would recommend the Linear Regression model.

The Linear Regression model has an out-of-sample R-squared of 0.8640, which is a good performance compared to the other models, and it is not far behind the Random Forest model's R-squared of 0.8795. Although the Random Forest model has better performance, its complexity and difficulty in interpretation can be challenging for someone without a background in machine learning.

END¶